{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Second deduplication\n",
    "\n",
    "This notebook begins with **manifestationmeta.tsv,** and moves toward a smaller dataset that *aspires* to contain only one copy of each \"work,\" in [FRBR terminology.](https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) \n",
    "\n",
    "However, the key word there is \"aspires.\" We actually rely on a probabilistic model that is known to be wrong about 11% of the time. The model predicts the probability that two records are \"the same work,\" using evidence that includes the similarity of their authors and titles in metadata, but also the degree of similarity between *their texts,* as measured through cosine similarity on extracted features.\n",
    "\n",
    "I've set the probability threshold at 66% to be cautious about collapsing works together. So when the model makes an error it will usually (7%) go wrong by saying that two works are different, and more rarely (4%) mistakenly claim they are the same.\n",
    "\n",
    "**Note** this process is not completely reproducible from this notebook alone; it involves 4GB of data about extracted features that I have not uploaded to the GitHub repo. However if you consult **../get_EF** you can see how to download and process that data yourself. Fair warning: it took ~30hrs of processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from difflib import SequenceMatcher\n",
    "from collections import Counter\n",
    "import unicodedata\n",
    "import math, random, pickle\n",
    "import statsmodels.api as sm\n",
    "from scipy import spatial\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### create blocks\n",
    "\n",
    "We start by grouping volumes into \"blocks.\" This is purely a time-reduction step, to avoid useless comparisons of very different volumes. Each block is identified by the first six characters of the author's name.\n",
    "\n",
    "This strategy does unfortunately mean that the first few characters of names become very important, which is why I made some effort to standardize naming in the first deduplication notebook -- moving e.g. \"sir\" and \"mrs\" to the end of the name. More could probably be done here: names like \"Du Maurier\" and \"Van Dyck\" are potentially tricky.\n",
    "\n",
    "We group volumes in \"blocks\" identified by the first six letters of the author's name. But we also group these blocks into 26 larger groups identified by their first initial. The reason for this is that we may need to parallelize processing and divide data into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta = pd.read_csv('../manifestationmeta.tsv', sep = '\\t', low_memory = False, index_col = 'docid')\n",
    "\n",
    "blocks = dict()\n",
    "\n",
    "for idx in meta.index:\n",
    "    name = meta.loc[idx, 'author']\n",
    "    if pd.isnull(name) or len(name) < 3:\n",
    "        name = 'nan'\n",
    "    else:\n",
    "        name = unicodedata.normalize('NFC', name.lower())\n",
    "    \n",
    "    if len(name) < 6:\n",
    "        blockcode = name\n",
    "    else:\n",
    "        blockcode = name[0:6]\n",
    "    \n",
    "    initial = blockcode[0]\n",
    "    if not initial.isalpha() or ord(initial) > 128:\n",
    "        initial = 'x'\n",
    "    \n",
    "    if not initial in blocks:\n",
    "        blocks[initial] = dict()\n",
    "    \n",
    "    if not blockcode in blocks[initial]:\n",
    "        blocks[initial][blockcode] = set()\n",
    "    \n",
    "    blocks[initial][blockcode].add(idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "blocks:  26\n",
      "all codes:  22346\n",
      "all volumes:  176623\n"
     ]
    }
   ],
   "source": [
    "print('blocks: ', len(blocks))\n",
    "allcodes = 0\n",
    "for b, block in blocks.items():\n",
    "    allcodes += len(block)\n",
    "print('all codes: ', allcodes)\n",
    "print('all volumes: ', len(meta.index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dividing up the text data and creating \"group\" rows\n",
    "\n",
    "I downloaded the HathiTrust extracted features for these 176,000 volumes and parsed them using **../get_EF/parsefeaturejsons.py** to produce a matrix where each row is a volume, and the top 1000 features are columns. This ends up being 4GB of data, which is a bit whopping to manipulate in pandas. Plus, I may need to parallelize the processing on different machines. So I'm going to divvy up the matrix.\n",
    "\n",
    "While I'm doing that, I'm also going to sneakily do a couple of other things. First, I'm going to center and scale each of these matrices. (I.e., subtract column mean from each column, and divide by stddev.) The matrices won't all have exactly the same scale, but I don't think that's mission-critical.\n",
    "\n",
    "Second, I'm going to add \"group\" rows to the matrices in cases where a volume belongs to a multi-volume record. This is a tricky aspect of textual similarity. Say I'm comparing a one-volume Middlemarch from 1960 to a 3-volume edition in 1881. \"Oh,\" my program says, \"this 1960 volume doesn't match the first volume of 1881.\" Well, no, of course it doesn't, because that's just the first volume, duh! To avoid that problem, we need to create a second row that sums all the evidence for the volumes. Here we do that by taking the mean of the volumes — since we're looking at frequencies rather than absolute counts, that's roughly adequate.\n",
    "\n",
    "**Note** that this cell doesn't have to be run on every pass. Since it writes results to disk, you only have to run it till you get it right. I haven't run it on this pass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matrix1 = pd.read_csv('/Volumes/TARDIS/work/ef/ficmatrix/featurematrix1.csv', index_col = 'docid')\n",
    "matrix2 = pd.read_csv('../data/featurematrix.csv', index_col = 'docid')\n",
    "inmat1 = set(matrix1.index)\n",
    "inmat2 = set(matrix2.index)\n",
    "\n",
    "def probablymatch(str1, str2):\n",
    "    \n",
    "    m = SequenceMatcher(None, str1, str2)\n",
    "    match = m.real_quick_ratio()\n",
    "    if match > 0.75:\n",
    "        match = m.ratio()\n",
    "    \n",
    "    return match\n",
    "\n",
    "def find_groups(df):\n",
    "    global meta\n",
    "    groupvecs = dict()\n",
    "    \n",
    "    for d in df.index:\n",
    "        record = int(meta.loc[d, 'recordid'])\n",
    "        title = str(meta.loc[d, 'shorttitle'])\n",
    "        thisrec = meta.loc[meta.recordid == record, : ]\n",
    "        matching = []\n",
    "        \n",
    "        for idx in thisrec.index:\n",
    "            thistitle = str(thisrec.loc[idx, 'shorttitle'])\n",
    "            if thistitle == title or probablymatch(thistitle, title) > 0.9:\n",
    "                matching.append(idx)\n",
    "\n",
    "        if len(matching) > 1 and len(matching) < 6:\n",
    "            matchvec = df.loc[matching, : ].mean(axis = 0)\n",
    "            newidx = d + \"group\"\n",
    "            groupvecs[newidx] = matchvec\n",
    "    \n",
    "    return pd.DataFrame.from_dict(groupvecs, orient = 'index')\n",
    "\n",
    "for initial, block in blocks.items():\n",
    "    allvols = set()\n",
    "    for code, vols in block.items():\n",
    "        allvols = allvols.union(vols)\n",
    "    group1 = allvols.intersection(inmat1)\n",
    "    df1 = matrix1.loc[group1, : ]\n",
    "    group2 = allvols.intersection(inmat2)\n",
    "    df2 = matrix2.loc[group2, : ]\n",
    "    df = pd.concat([df1, df2])\n",
    "    print(initial, df.shape)\n",
    "    \n",
    "    # let's scale the matrix\n",
    "    scaler = StandardScaler()\n",
    "    scaler.fit(df)\n",
    "    scaled = scaler.transform(df)\n",
    "    df = pd.DataFrame(scaled, index = df.index)\n",
    "    \n",
    "    # augment the matrix with group rows\n",
    "    groupeddf = find_groups(df)\n",
    "    df = pd.concat([df, groupeddf])\n",
    "    print(df.shape)\n",
    "    print()\n",
    "    outfile = '/Volumes/TARDIS/work/ef/ficmatrix/matrix_' + initial + '.csv'\n",
    "    df.to_csv(outfile)\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the model\n",
    "\n",
    "We need to train a model of similarity between volumes, and save the model.\n",
    "\n",
    "Let's start by reading in the relevant data. Then, let's make sure that the data uses the same definition of has_works that will be used below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At first:  182\n",
      "After checking:  229\n"
     ]
    }
   ],
   "source": [
    "data = pd.read_csv('fulltrainingdata.tsv', sep = '\\t')\n",
    "\n",
    "def has_works(row):\n",
    "    ''' Returns 1 if either title in a pair has the word \"works\",\n",
    "    or the word \"novels.\" '''\n",
    "    \n",
    "    words1 = row.title1.lower().split()\n",
    "    words1 = [x.strip(',. ') for x in words1]\n",
    "    words2 = row.title2.lower().split()\n",
    "    words2 = [x.strip(',. ') for x in words2]\n",
    "    \n",
    "    if ('works' in words1 or 'novels' in words1) or ('works' in words2 or 'novels' in words2):\n",
    "        return 1\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "print(\"At first: \", sum(data.hasworks))\n",
    "data = data.assign(hasworks = data.apply(has_works, axis = 1))\n",
    "print(\"After checking: \", sum(data.hasworks))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That needed doing: we previously only counted \"works\" and now have expanded to \"novels.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization terminated successfully.\n",
      "         Current function value: 0.288404\n",
      "         Iterations 8\n",
      "                           Logit Regression Results                           \n",
      "==============================================================================\n",
      "Dep. Variable:            groundtruth   No. Observations:                 1109\n",
      "Model:                          Logit   Df Residuals:                     1106\n",
      "Method:                           MLE   Df Model:                            2\n",
      "Date:                Sun, 06 May 2018   Pseudo R-squ.:                  0.5505\n",
      "Time:                        09:59:50   Log-Likelihood:                -319.84\n",
      "converged:                       True   LL-Null:                       -711.53\n",
      "                                        LLR p-value:                7.780e-171\n",
      "==============================================================================\n",
      "                 coef    std err          z      P>|z|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "titlematch     3.4766      0.198     17.539      0.000       3.088       3.865\n",
      "cossim        -8.0188      0.579    -13.851      0.000      -9.153      -6.884\n",
      "hasworks      -3.5458      0.334    -10.603      0.000      -4.201      -2.890\n",
      "==============================================================================\n"
     ]
    }
   ],
   "source": [
    "data.to_csv('fulltrainingdata.tsv', sep = '\\t', index = False)\n",
    "\n",
    "X = data[['titlematch', 'cossim', 'hasworks']]\n",
    "y = data['groundtruth']\n",
    "\n",
    "# Now actually train the model\n",
    "\n",
    "logit_model=sm.Logit(y,X)\n",
    "result=logit_model.fit()\n",
    "print(result.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Feature interpretation\n",
    "\n",
    "**titlematch** is the fuzzy similarity between the titles of two volumes\n",
    "\n",
    "**cossim** is the cosine similarity (or really divergence) between their texts\n",
    "\n",
    "**hasworks** is a binary variable, either 0 or 1. It's 1 for comparisons where either title contains the word \"works\" or the word \"novels.\" This turned out to be very important, because we usually don't want to consider volumes of \"Collected Works\" or \"Waverley Novels\" as a match (if they lack shorter titles). And there are a lot of such volumes!\n",
    "\n",
    "You might wonder why author similarity isn't included as a factor to predict identity. Given the quirks of our training data, the similarity of authors was actually a *negative* predictor that two volumes were the same. Basically, the authors likely to have negatives (misses) in our training sample were the common authors, with lots of collected works and close-miss titles. But we tended to have already standardized the names of common authors. So the uncommon authors, with variant names and low authormatch scores, were actually usually positive matches! But I wasn't confident that this would hold true generally, so I excluded the variable.\n",
    "\n",
    "#### checking false positives:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fn 74\n",
      "fp 49\n",
      "error:  0.1109107303877367\n",
      "mistaken positives:  0.04418394950405771\n",
      "mistaken negatives:  0.06672678088367899\n"
     ]
    }
   ],
   "source": [
    "thresh = 0.66\n",
    "fp = 0\n",
    "fn = 0\n",
    "\n",
    "for i in range(len(y)):\n",
    "    df = X.iloc[i, : ].to_frame().transpose()\n",
    "    pred = float(result.predict(df))\n",
    "    reality = y[i]\n",
    "    if pred > thresh and reality < 0.5:\n",
    "        fp += 1\n",
    "    elif pred < thresh and reality > 0.5:\n",
    "        fn += 1\n",
    "print('fn', fn)\n",
    "print('fp', fp)\n",
    "print('error: ', (fn+fp) / len(y))\n",
    "print('mistaken positives: ', fp / len(y))\n",
    "print('mistaken negatives: ', fn / len(y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Functions that load a block and check for matches\n",
    "\n",
    "The key function here is **get_matches(),** which loops through each block, comparing each record to all the other records in the block. For each comparison, we first check a few basic thresholds (author similarity and title similarity must be > 0.8). If the connection passes those thresholds, we pass title similarity, cosine similarity of texts, and \"hasworks\" to the model.\n",
    "\n",
    "For each volume, we keep a record of all the other volumes that match it. We can later transform this dictionary of *edges* into a list of *connected components*.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def probablymatch(str1, str2):\n",
    "    '''Runs a quick check, and a better check if the upper bound on\n",
    "    quick check suggests a better check is needed.'''\n",
    "    m = SequenceMatcher(None, str1, str2)\n",
    "    match = m.real_quick_ratio()\n",
    "    if match > 0.75:\n",
    "        match = m.ratio()\n",
    "    \n",
    "    return match\n",
    "\n",
    "def has_works(title1, title2):\n",
    "    ''' Returns 1 if either title in a pair has the word \"works\",\n",
    "    or the word \"novels.\" '''\n",
    "    \n",
    "    words1 = title1.lower().split()\n",
    "    words1 = set([x.strip(',. ') for x in words1])\n",
    "    words2 = title2.lower().split()\n",
    "    words2 = set([x.strip(',. ') for x in words2])\n",
    "    \n",
    "    if ('works' in words1 or 'novels' in words1) or ('works' in words2 or 'novels' in words2):\n",
    "        return 1\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "def cleanstring(astring, cap):\n",
    "    astring = astring.replace(';', '')\n",
    "    astring = astring.replace(':', '')\n",
    "    astring = astring.lower()\n",
    "    if len(astring) > cap:\n",
    "        astring = astring[0 : cap]\n",
    "    return astring\n",
    "\n",
    "def calculate_cossim(doc1, doc2, df, inmatrix):\n",
    "    ''' Calculates cosine similarity between two volumes, and between the\n",
    "    larger groups of vols they belong to, if those groups exist.\n",
    "    '''\n",
    "    if doc1 in inmatrix and doc2 in inmatrix:\n",
    "        vec1 = df.loc[doc1, : ]\n",
    "        vec2 = df.loc[doc2, : ]\n",
    "        cossimA = spatial.distance.cosine(vec1, vec2)\n",
    "        \n",
    "        doc1groupidx = doc1 + 'group'\n",
    "        doc2groupidx = doc2 + 'group'\n",
    "        \n",
    "        if doc1groupidx in inmatrix:\n",
    "            grouped1 = df.loc[doc1groupidx, : ]\n",
    "            cossimB = spatial.distance.cosine(grouped1, vec2)\n",
    "        else:\n",
    "            cossimB = 100\n",
    "        \n",
    "        if doc2groupidx in inmatrix:\n",
    "            grouped2 = df.loc[doc2groupidx, : ]\n",
    "            cossimC = spatial.distance.cosine(vec1, grouped2)\n",
    "        else:\n",
    "            cossimC = 100\n",
    "        \n",
    "        if cossimB < 100 and cossimC < 100:\n",
    "            cossimD = spatial.distance.cosine(grouped1, grouped2)\n",
    "        else:\n",
    "            cossimD = 100\n",
    "        \n",
    "        cossim = min(cossimA, cossimB, cossimC, cossimD)\n",
    "        \n",
    "    else:\n",
    "        cossim = 0.2151\n",
    "        # This was the mean in our training set, and will be used in\n",
    "        # place of NA for comparisons where either vol is missing.\n",
    "    \n",
    "    return cossim\n",
    "    \n",
    "def get_matches(initial, blocks, model):\n",
    "    \n",
    "    block = blocks[initial]\n",
    "    \n",
    "    # get the text data for this block\n",
    "    dataname = '/Volumes/TARDIS/work/ef/ficmatrix/matrix_' + initial + '.csv'\n",
    "    textmatrix = pd.read_csv(dataname, index_col = 'docid')\n",
    "    inmatrix = set(textmatrix.index)\n",
    "    \n",
    "    matches = dict()\n",
    "    repeats = 0\n",
    "    \n",
    "    for code, volset in block.items():\n",
    "        \n",
    "        vols = list(volset)\n",
    "        \n",
    "        already_checked = dict()\n",
    "        titledict = dict()\n",
    "        authdict = dict()\n",
    "    \n",
    "        # we clean all the titles and authors in the vols before \n",
    "        # attempting to match; otherwise you end up doing\n",
    "        # n x n cleaning operations.\n",
    "        \n",
    "        # we also initialize matches\n",
    "    \n",
    "        for b in vols:\n",
    "            if b not in matches:\n",
    "                matches[b] = set()\n",
    "            else:\n",
    "                repeats += 1\n",
    "                # that shouldn't happen\n",
    "                \n",
    "            auth = meta.loc[b, 'author']\n",
    "            if pd.isnull(auth) or len(auth) < 3:\n",
    "                auth = 'cannot-match'\n",
    "            else:\n",
    "                auth = cleanstring(auth, 25)\n",
    "\n",
    "            title = meta.loc[b, 'shorttitle']\n",
    "            if pd.isnull(title) or len(title) < 3:\n",
    "                title = 'cannot-match'\n",
    "            else:\n",
    "                title = cleanstring(title, 35)\n",
    "\n",
    "            titledict[b] = title\n",
    "            authdict[b] = auth\n",
    "\n",
    "        for idx, b1 in enumerate(vols):\n",
    "            \n",
    "            for b2 in vols[idx + 1 : ]:\n",
    "                \n",
    "                auth1 = authdict[b1]\n",
    "                auth2 = authdict[b2]\n",
    "                title1 = titledict[b1]\n",
    "                title2 = titledict[b2]\n",
    "\n",
    "                if auth1 == 'cannot-match' or auth2 == 'cannot-match':\n",
    "                    continue\n",
    "                    \n",
    "                if title1 == 'cannot-match' or title2 == 'cannot-match':\n",
    "                    continue\n",
    "\n",
    "                if auth1 == auth2:\n",
    "                    authormatch = 1.0\n",
    "                else:\n",
    "                    authormatch = probablymatch(auth1, auth2)\n",
    "                    if authormatch < 0.8:\n",
    "                        # we insist on more similarity in authors\n",
    "                        continue\n",
    "\n",
    "                if title1 == title2:\n",
    "                    titlematch = 1.0\n",
    "                else:\n",
    "                    titlematch = probablymatch(title1, title2)\n",
    "                    if titlematch < 0.8:\n",
    "                        # we insist on more similarity in titles\n",
    "                        continue\n",
    "\n",
    "                cossim = calculate_cossim(b1, b2, textmatrix, inmatrix)\n",
    "                hasworks = has_works(title1, title2)\n",
    "\n",
    "                testdf = pd.DataFrame({'titlematch': titlematch, 'cossim': cossim, 'hasworks': hasworks}, index = ['test'], dtype = 'float64')\n",
    "                testdf = testdf[['titlematch', 'cossim', 'hasworks']]\n",
    "                probability = float(model.predict(testdf))\n",
    "                \n",
    "                if probability < 0.66:\n",
    "                    continue\n",
    "                    # this threshold (more demanding than 0.5) was tested above\n",
    "                    # under \"checking false positives.\"\n",
    "                    \n",
    "                else:\n",
    "                    matches[b1].add(b2)\n",
    "                    matches[b2].add(b1)\n",
    "                    \n",
    "    if repeats > 0:\n",
    "        print('repeats ', repeats)\n",
    "    return matches              "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Functions that connect components\n",
    "\n",
    "The previous function gave us a dictionary where each volume is linked to a set of volumes that match it. This is in essence a data structure of *edges* in a graph.\n",
    "\n",
    "Now we need to transform that structure into a list of *connected components*. Basically, like so:\n",
    "\n",
    "![caption](files/connected.png)\n",
    "\n",
    "Image credit: [Sebastian Thomas.](https://www.mathworks.com/matlabcentral/fileexchange/46457-splitting-a-network-into-connected-components)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dfs(vertex, matchdict, visited, components, component_ctr):\n",
    "    ''' Depth-first search algorithm. '''\n",
    "    visited.add(vertex)\n",
    "    components[component_ctr].add(vertex)\n",
    "    for link in matchdict[vertex]:\n",
    "        if link not in visited:\n",
    "            dfs(link, matchdict, visited, components, component_ctr)\n",
    "            \n",
    "def connect_components(matchdict):\n",
    "    ''' Visit each vertex. If not yet visited, create a new component, and do \n",
    "    depth-first search on the vertex, adding all linked vertices to the new\n",
    "    component.\n",
    "    '''\n",
    "    \n",
    "    visited = set()\n",
    "    components = []\n",
    "    component_ctr = 0\n",
    "    \n",
    "    for vertex, links in matchdict.items():\n",
    "        if vertex not in visited:\n",
    "            components.append(set())\n",
    "            dfs(vertex, matchdict, visited, components, component_ctr)\n",
    "            component_ctr += 1\n",
    "    \n",
    "    return components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "q | u | y | z | i | x | j | v | o | e | n | f | w | t | r | p | a | k | g | l | d | h | c | b | m | s\n"
     ]
    }
   ],
   "source": [
    "# I prefer to go through the blocks from smallest to largest so that I can see whether\n",
    "# the function works without waiting forever.\n",
    "\n",
    "initialist = []\n",
    "for initial, block in blocks.items():\n",
    "    initialist.append((len(block), initial))\n",
    "\n",
    "initialist.sort()\n",
    "initialist = [x[1] for x in initialist]\n",
    "print(' | '.join(initialist))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Actually using the functions above to find groups\n",
    "\n",
    "The list of groups will be stored in a variable named **components.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "q 62 | u 135 | y 148 | z 175 | i 245 | x 353 | j 358 | v 460 | o 538 | e 547 | n 548 | f 730 | w 777 | t 853 | r 953 | p 1025 | a 1038 | k 1096 | g 1115 | l 1177 | d 1233 | h 1297 | c 1450 | b 1988 | m 1992 | s 2053 | "
     ]
    }
   ],
   "source": [
    "components = []\n",
    "for initial in initialist:\n",
    "    print(initial, len(blocks[initial]), end = ' | ')\n",
    "    matches = get_matches(initial, blocks, result)\n",
    "    components.extend(connect_components(matches))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A little exploratory description\n",
    "\n",
    "E.g., how many groups do we have? How big is the biggest?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We have 131538 different components,\n",
      "of which the biggest contains 201 vols.\n",
      "\n",
      "Here's a member of that group: \n",
      "oldauthor                                           Defoe, Daniel\n",
      "author                                              Defoe, Daniel\n",
      "authordate                                            1661?-1731.\n",
      "inferreddate                                                 1891\n",
      "latestcomp                                                   1731\n",
      "datetype                                                        s\n",
      "startdate                                                    1891\n",
      "enddate                                                          \n",
      "imprint                                   London;T.F. Unwin;1891.\n",
      "imprintdate                                                  1891\n",
      "contents                                                      NaN\n",
      "genres                                               UnknownGenre\n",
      "subjects                                                      NaN\n",
      "geographics                                                   NaN\n",
      "locnum                                                        NaN\n",
      "oclc                                                     37198907\n",
      "place                                                         enk\n",
      "recordid                                                  1369806\n",
      "enumcron                                                      NaN\n",
      "volnum                                                        NaN\n",
      "title           The adventures of Robinson Crusoe, | $c: by Da...\n",
      "parttitle                                                     NaN\n",
      "shorttitle                      The adventures of Robinson Crusoe\n",
      "instances                                                       1\n",
      "Name: mdp.39015078552018, dtype: object\n"
     ]
    }
   ],
   "source": [
    "print('We have ' + str(len(components)) + \" different components,\")\n",
    "\n",
    "maxsize = 0\n",
    "for c in components:\n",
    "    if len(c) > maxsize:\n",
    "        maxsize = len(c)\n",
    "        for ex_biggest in c:\n",
    "            break\n",
    "print(\"of which the biggest contains \" + str(maxsize) + \" vols.\")\n",
    "print()\n",
    "print(\"Here's a member of that group: \")\n",
    "print(meta.loc[ex_biggest, : ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now the actual deduplication\n",
    "\n",
    "In principle, generally, we want to take one volume from each group of volumes that have matching titles and authors. And in general we want to take the earliest volume, so our resulting dataset will be dated as close as possible to dates of first publication.\n",
    "\n",
    "However, there are complicating cases. What if, for instance, the earliest instance of a novel is a Victorian three-decker edition? That's going to happen pretty often. In that case, we don't want to take *just one volume* from the group; we want all three volumes of the earliest edition. So we need a new rule: take all volumes sharing the *recordid* of the earliest volume. That will get all three volumes of a three-volume edition.\n",
    "\n",
    "But we confront yet another complication! Volumes grouped by a recordid are sometimes three volumes of a single work. But often they are, say, 28 volumes in the *Collected Works of Scott.* All sharing a single record id, but not all the same fictional work. Maybe some of the longer novels are spread across 2 or three volumes, but many of the volumes represent a single novel. This gets bloody complicated.\n",
    "\n",
    "So our *new* rule is: find the earliest volume. Get its record id. Find all volumes sharing that record id (all volumes in the same set). Then take all the volumes that share the same *short title*. If we have been able to identify vols 11 and 12 as *Ivanhoe,* this will get just 11 and 12. However, if we haven't been able to identify titles beyond *Collected Works of Scott,* we'll get all 28 vols! So the final rule is, ignore cases where we recover more than five vols sharing the same recordid. We suspect these are collected works.\n",
    "\n",
    "As we do this, we are going to want to keep track of the number of copies of a volume that have been collapsed into a single deduplicated record. We'll use a column of \"instances\" created in the earlier stage of deduplication; this counts vols that had the same recordid+volnum. We'll further aggregate that into \"copies\": vols that had the same author/title. Moreover, since we may want to distinguish *contemporary* popularity from later canonicity, we're going to keep track of this in two different ways: a general column of copies and a column of copies-published-within-25-yrs of our first example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "10001\n",
      "20001\n",
      "30001\n",
      "40001\n",
      "50001\n",
      "60001\n",
      "70001\n",
      "80001\n",
      "90001\n",
      "100001\n",
      "110001\n",
      "120001\n",
      "130001\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "selected = []\n",
    "ignored = []\n",
    "errors = 0\n",
    "authtitlecopies = dict()\n",
    "copiesin25yrs = dict()\n",
    "authorsets = []\n",
    "\n",
    "ctr = 0\n",
    "for g in components:\n",
    "    ctr += 1\n",
    "    if ctr % 10000 == 1:\n",
    "        print(ctr)\n",
    "    \n",
    "    # Some groups contain only a single volume.\n",
    "    if len(g) == 1:\n",
    "        for e in g:\n",
    "            break\n",
    "        selected.append(e)\n",
    "        authtitlecopies[e] = int(meta.loc[e, 'instances'])\n",
    "        copiesin25yrs[e] = authtitlecopies[e]\n",
    "        # For a single volume, all these quantities will be the same.\n",
    "        continue\n",
    "        \n",
    "    if len(g) < 1:\n",
    "        errors += 1\n",
    "        continue\n",
    "    \n",
    "    earliest = ''\n",
    "    earliestdate = 2100\n",
    "    instancectr = Counter()\n",
    "    authorset = set()\n",
    "    \n",
    "    for element in g:\n",
    "        date = meta.loc[element, 'inferreddate']\n",
    "        copies = int(meta.loc[element, 'instances'])\n",
    "        auth = meta.loc[element, 'author']\n",
    "        if not pd.isnull(auth):\n",
    "            authorset.add(auth)\n",
    "        \n",
    "        if pd.isnull(date) or int(date) == 0:\n",
    "            date = 2100\n",
    "        else:\n",
    "            date = int(date)\n",
    "        \n",
    "        instancectr[date] += copies\n",
    "        \n",
    "        if earliestdate == 2100 or date < earliestdate:\n",
    "            earliestdate = date\n",
    "            earliest = element\n",
    "            if earliestdate < 1700:\n",
    "                earliestdate = 2100\n",
    "                # don't reward dubious dates\n",
    "                \n",
    "    # different authnames?\n",
    "    if len(authorset) > 1:\n",
    "        authorsets.append(authorset)\n",
    "        \n",
    "    # now let's add up those copies\n",
    "    allcopies = 0\n",
    "    copiesin25yrsofearliest = 0\n",
    "    \n",
    "    for date, count in instancectr.items():\n",
    "        allcopies += count\n",
    "        if date < (earliestdate + 25):\n",
    "            copiesin25yrsofearliest += count\n",
    "            \n",
    "    record = meta.loc[earliest, 'recordid']\n",
    "    title2match = str(meta.loc[earliest, 'shorttitle'])\n",
    "\n",
    "    matching = []\n",
    "\n",
    "    thisrec = meta.loc[meta.recordid == record, : ]\n",
    "    for idx in thisrec.index:\n",
    "        thistitle = str(thisrec.loc[idx, 'shorttitle'])\n",
    "        match = probablymatch(title2match, thistitle)\n",
    "        if match > 0.9:\n",
    "            matching.append(idx)\n",
    "    \n",
    "    if len(matching) < 6:\n",
    "        selected.extend(matching)\n",
    "        for m in matching:\n",
    "            authtitlecopies[m] = allcopies\n",
    "            copiesin25yrs[m] = copiesin25yrsofearliest\n",
    "    else:\n",
    "        ignored.append((title2match, record))\n",
    "        \n",
    "print(errors)          "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Some exploratory description\n",
    "\n",
    "For instance, how many records did we select? How many groups of vols were ignored?\n",
    "\n",
    "Note also that I quietly prune duplicate docids from the **selected** list. The algorithm above permits some duplication to happen, though it's not huge.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "138160\n",
      "138137\n"
     ]
    }
   ],
   "source": [
    "print(len(selected))\n",
    "\n",
    "# get rid of duplicates\n",
    "selected = list(set(selected))\n",
    "print(len(selected))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "204\n",
      "401\n"
     ]
    }
   ],
   "source": [
    "print(len(ignored))\n",
    "print(len(authorsets))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's write the ignored records to file\n",
    "\n",
    "with open('ignoredgroups.tsv', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for title, record in ignored:\n",
    "        f.write(title + '\\t' + str(record) + '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Also the groups\n",
    "\n",
    "with open('allgroups.tsv', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for g in components:\n",
    "        f.write('\\t'.join(g) + '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "300\n"
     ]
    }
   ],
   "source": [
    "# And the authorsets\n",
    "\n",
    "authorsets = set([tuple(x) for x in authorsets])\n",
    "print(len(authorsets))\n",
    "# reduce duplication\n",
    "\n",
    "with open('authorsets.tsv', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for s in authorsets:\n",
    "        f.write('\\t'.join(s) + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now actually produce and write the dataframe\n",
    "\n",
    "All of our effort so far has gone into selecting a list of indices that will be retained. Now we have to use those indices to actually produce a new dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# like so\n",
    "\n",
    "deduped = meta.loc[selected, : ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>oldauthor</th>\n",
       "      <th>author</th>\n",
       "      <th>authordate</th>\n",
       "      <th>inferreddate</th>\n",
       "      <th>latestcomp</th>\n",
       "      <th>datetype</th>\n",
       "      <th>startdate</th>\n",
       "      <th>enddate</th>\n",
       "      <th>imprint</th>\n",
       "      <th>imprintdate</th>\n",
       "      <th>...</th>\n",
       "      <th>locnum</th>\n",
       "      <th>oclc</th>\n",
       "      <th>place</th>\n",
       "      <th>recordid</th>\n",
       "      <th>enumcron</th>\n",
       "      <th>volnum</th>\n",
       "      <th>title</th>\n",
       "      <th>parttitle</th>\n",
       "      <th>shorttitle</th>\n",
       "      <th>instances</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>docid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>uc1.b4089311</th>\n",
       "      <td>Price, Reynolds</td>\n",
       "      <td>Price, Reynolds</td>\n",
       "      <td>1933-2011.</td>\n",
       "      <td>1963</td>\n",
       "      <td>1963</td>\n",
       "      <td>s</td>\n",
       "      <td>1963</td>\n",
       "      <td></td>\n",
       "      <td>New York|Atheneum|1963.</td>\n",
       "      <td>1963</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>291906</td>\n",
       "      <td>nyu</td>\n",
       "      <td>1029196</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The names and faces of heroes.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The names and faces of heroes</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.b248388</th>\n",
       "      <td>Gingold, H??l??ne</td>\n",
       "      <td>Gingold, H??l??ne</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1893</td>\n",
       "      <td>1893</td>\n",
       "      <td>s</td>\n",
       "      <td>1893</td>\n",
       "      <td></td>\n",
       "      <td>London;Sydney;Remington;1893.</td>\n",
       "      <td>1893</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>22190611</td>\n",
       "      <td>enk</td>\n",
       "      <td>6501321</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Seven stories</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Seven stories</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mdp.39015077654393</th>\n",
       "      <td>Morales, Federico</td>\n",
       "      <td>Morales, Federico</td>\n",
       "      <td>1980-</td>\n",
       "      <td>2008</td>\n",
       "      <td>2008</td>\n",
       "      <td>s</td>\n",
       "      <td>2008</td>\n",
       "      <td></td>\n",
       "      <td>Richmond, B.C.|FreedRow Pub.|2008.</td>\n",
       "      <td>2008</td>\n",
       "      <td>...</td>\n",
       "      <td>PR9199.4.M67D39 2008</td>\n",
       "      <td>222518802</td>\n",
       "      <td>bcc</td>\n",
       "      <td>5812142</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Family, friends, and lovers / | $c: Federico M...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Family, friends, and lovers</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.$b391262</th>\n",
       "      <td>Brophy, Brigid</td>\n",
       "      <td>Brophy, Brigid</td>\n",
       "      <td>1929-1995.</td>\n",
       "      <td>1970</td>\n",
       "      <td>1970</td>\n",
       "      <td>r</td>\n",
       "      <td>1970</td>\n",
       "      <td></td>\n",
       "      <td>New York|Putnam|1970, c1969</td>\n",
       "      <td>1970</td>\n",
       "      <td>...</td>\n",
       "      <td>PZ4.B8735In3PR6052.R583</td>\n",
       "      <td>51109</td>\n",
       "      <td>nyu</td>\n",
       "      <td>9447530</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>In transit; | an heroi-cyclic novel.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>In transit; an heroi-cyclic novel</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc1.b3144885</th>\n",
       "      <td>Baker, Karle Wilson</td>\n",
       "      <td>Baker, Karle Wilson</td>\n",
       "      <td>1878-1960.</td>\n",
       "      <td>1923</td>\n",
       "      <td>1923</td>\n",
       "      <td>s</td>\n",
       "      <td>1923</td>\n",
       "      <td></td>\n",
       "      <td>New Haven|Yale University Press; [etc., etc.|1...</td>\n",
       "      <td>1923</td>\n",
       "      <td>...</td>\n",
       "      <td>PS3503.A5435O6 1923</td>\n",
       "      <td>1059608</td>\n",
       "      <td>ctu</td>\n",
       "      <td>6110253</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Old coins, | $c: by Karle Wilson Baker.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Old coins</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                              oldauthor               author  authordate  \\\n",
       "docid                                                                      \n",
       "uc1.b4089311            Price, Reynolds      Price, Reynolds  1933-2011.   \n",
       "uc1.b248388           Gingold, H??l??ne    Gingold, H??l??ne         NaN   \n",
       "mdp.39015077654393    Morales, Federico    Morales, Federico       1980-   \n",
       "uc1.$b391262             Brophy, Brigid       Brophy, Brigid  1929-1995.   \n",
       "uc1.b3144885        Baker, Karle Wilson  Baker, Karle Wilson  1878-1960.   \n",
       "\n",
       "                    inferreddate  latestcomp datetype startdate enddate  \\\n",
       "docid                                                                     \n",
       "uc1.b4089311                1963        1963        s      1963           \n",
       "uc1.b248388                 1893        1893        s      1893           \n",
       "mdp.39015077654393          2008        2008        s      2008           \n",
       "uc1.$b391262                1970        1970        r      1970           \n",
       "uc1.b3144885                1923        1923        s      1923           \n",
       "\n",
       "                                                              imprint  \\\n",
       "docid                                                                   \n",
       "uc1.b4089311                                  New York|Atheneum|1963.   \n",
       "uc1.b248388                             London;Sydney;Remington;1893.   \n",
       "mdp.39015077654393                 Richmond, B.C.|FreedRow Pub.|2008.   \n",
       "uc1.$b391262                              New York|Putnam|1970, c1969   \n",
       "uc1.b3144885        New Haven|Yale University Press; [etc., etc.|1...   \n",
       "\n",
       "                   imprintdate    ...                       locnum       oclc  \\\n",
       "docid                             ...                                           \n",
       "uc1.b4089311              1963    ...                          NaN     291906   \n",
       "uc1.b248388               1893    ...                          NaN   22190611   \n",
       "mdp.39015077654393        2008    ...         PR9199.4.M67D39 2008  222518802   \n",
       "uc1.$b391262              1970    ...      PZ4.B8735In3PR6052.R583      51109   \n",
       "uc1.b3144885              1923    ...          PS3503.A5435O6 1923    1059608   \n",
       "\n",
       "                   place recordid enumcron volnum  \\\n",
       "docid                                               \n",
       "uc1.b4089311         nyu  1029196      NaN    NaN   \n",
       "uc1.b248388          enk  6501321      NaN    NaN   \n",
       "mdp.39015077654393   bcc  5812142      NaN    NaN   \n",
       "uc1.$b391262         nyu  9447530      NaN    NaN   \n",
       "uc1.b3144885         ctu  6110253      NaN    NaN   \n",
       "\n",
       "                                                                title  \\\n",
       "docid                                                                   \n",
       "uc1.b4089311                           The names and faces of heroes.   \n",
       "uc1.b248388                                             Seven stories   \n",
       "mdp.39015077654393  Family, friends, and lovers / | $c: Federico M...   \n",
       "uc1.$b391262                     In transit; | an heroi-cyclic novel.   \n",
       "uc1.b3144885                  Old coins, | $c: by Karle Wilson Baker.   \n",
       "\n",
       "                    parttitle                         shorttitle  instances  \n",
       "docid                                                                        \n",
       "uc1.b4089311              NaN      The names and faces of heroes          2  \n",
       "uc1.b248388               NaN                      Seven stories          2  \n",
       "mdp.39015077654393        NaN        Family, friends, and lovers          1  \n",
       "uc1.$b391262              NaN  In transit; an heroi-cyclic novel          1  \n",
       "uc1.b3144885              NaN                          Old coins          1  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deduped.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### add copy counts\n",
    "\n",
    "Before we write out the dataframe, add columns reflecting the number of copies collapsed into each record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_copy_count(idx, dictionary):\n",
    "    return dictionary[idx]\n",
    "\n",
    "deduped = deduped.assign(allcopiesofwork = deduped.apply(lambda row: get_copy_count(row.name, authtitlecopies), axis = 1))\n",
    "deduped = deduped.assign(copiesin25yrs = deduped.apply(lambda row: get_copy_count(row.name, copiesin25yrs), axis = 1))\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['oldauthor', 'author', 'authordate', 'inferreddate', 'latestcomp',\n",
      "       'datetype', 'startdate', 'enddate', 'imprint', 'imprintdate',\n",
      "       'contents', 'genres', 'subjects', 'geographics', 'locnum', 'oclc',\n",
      "       'place', 'recordid', 'enumcron', 'volnum', 'title', 'parttitle',\n",
      "       'shorttitle', 'instances', 'allcopiesofwork', 'copiesin25yrs'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(deduped.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort rows\n",
    "deduped.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)\n",
    "\n",
    "# put columns in desired order (title last)\n",
    "deduped = deduped[['oldauthor', 'author', 'authordate', 'inferreddate',\n",
    "       'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',\n",
    "       'imprintdate', 'contents', 'genres', 'subjects', 'geographics',\n",
    "       'locnum', 'oclc', 'place', 'recordid', 'instances', 'allcopiesofwork',\n",
    "       'copiesin25yrs', 'enumcron', 'volnum', 'title',\n",
    "       'parttitle', 'shorttitle']]\n",
    "\n",
    "# write to file\n",
    "deduped.to_csv('newworkmeta.tsv', sep = '\\t', index = True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
